Mark Dunning, Suraj Menon and Aiora Zabala. Original material by Robert Stojnić, Laurent Gatto, Rob Foy John Davey, Dávid Molnár and Ian Roberts
Last modified: 17 Mar 2016
https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919
http://www.revolutionanalytics.com/companies-using-r
Sidney Harris - New York Times
R)To launch RStudio, find the RStudio icon in the menu bar on the left of the screen and click
2 + 2
[1] 4
20/5 - sqrt(25) + 3^2
[1] 8
sin(pi/2)
[1] 1
Note: The number in the square brackets is an indicator of the position in the output. In this case the output is a ‘vector’ of length 1 (i.e. a single number). More on vectors coming up…
<-x <- 10
x
[1] 10
myNumber <- 25
myNumber
[1] 25
sqrt(myNumber)
[1] 5
x + myNumber
[1] 35
x <- 21
x
[1] 21
x <- myNumber
x
myNumber <- myNumber + sqrt(16)
myNumber
sin(x)
sum(3,4,5,6)
[1] 18
max(3,4,5,6)
[1] 6
min(3,4,5,6)
[1] 3
seq(from = 2, to = 20, by = 4)
[1] 2 6 10 14 18
seq(2, 20, 4)
[1] 2 6 10 14 18
c combines its arguments into a vector:x <- c(3,4,5,6)
x
[1] 3 4 5 6
[] indicate the position within the vector (the index).[] notation:x[1]
x[4]
y <- c(2,3)
x[y]
[1] 4 5
x <- c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
x <- 3:12
x
[1] 3 4 5 6 7 8 9 10 11 12
seq() function, which returns a vector:x <- seq(2, 20, 4)
x
[1] 2 6 10 14 18
x <- seq(2, 20, length.out=5)
x
[1] 2.0 6.5 11.0 15.5 20.0
rep() function:y <- rep(3, 5)
y
[1] 3 3 3 3 3
y <- rep(1:3, 5)
y
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
x <- 3:12
# Extract elements from x:
x[3:7]
[1] 5 6 7 8 9
x[seq(2, 6, 2)]
[1] 4 6 8
x[rep(3, 2)]
[1] 5 5
y <- c(x, 1)
y
[1] 3 4 5 6 7 8 9 10 11 12 1
z <- c(x, y)
z
[1] 3 4 5 6 7 8 9 10 11 12 3 4 5 6 7 8 9 10 11 12 1
x <- 3:12
x[-3]
[1] 3 4 6 7 8 9 10 11 12
x[-(5:7)]
[1] 3 4 5 6 10 11 12
x[-seq(2, 6, 2)]
[1] 3 5 7 9 10 11 12
x[6] <- 4
x
x[3:5] <- 1
x
Remember!
x <- 1:10
y <- x*2
y
[1] 2 4 6 8 10 12 14 16 18 20
z <- x^2
z
[1] 1 4 9 16 25 36 49 64 81 100
y + z
[1] 3 8 15 24 35 48 63 80 99 120
x + 1:2
[1] 2 4 4 6 6 8 8 10 10 12
x + 1:3
Warning in x + 1:3: longer object length is not a
multiple of shorter object length
[1] 2 4 6 5 7 9 8 10 12 11
gene.names <- c("Pax6", "Beta-actin", "FoxP2", "Hox9")
gene.names
[1] "Pax6" "Beta-actin" "FoxP2"
[4] "Hox9"
names() function, which can be useful to keep track of the meaning of our data:gene.expression <- c(0, 3.2, 1.2, -2)
names(gene.expression) <- gene.names
gene.expression
Pax6 Beta-actin FoxP2 Hox9
0.0 3.2 1.2 -2.0
names() function to get a vector of the names of an object:names(gene.expression)
[1] "Pax6" "Beta-actin" "FoxP2"
[4] "Hox9"
| Species | Genome size (Mb) | Protein coding genes |
|---|---|---|
| Homo sapiens | 3,102 | 20,774 |
| Mus musculus | 2,731 | 23,139 |
| Drosophila melanogaster | 169 | 13,937 |
| Caenorhabditis elegans | 100 | 20,532 |
| Saccharomyces cerevisiae | 12 | 6,692 |
c function. Create a species.name vector and use this vector to name the values in the other two vectors.coding.bases.coding.bases and genome.size vectors to calculate this. (See earlier slides for how to do division in R.)genome.size <- c(3102, 2731, 169, 100, 12)
coding.genes <- c(20774, 23139, 13937, 20532, 6692)
species.name <- c("H. sapiens","M. musculus",
"D. melanogaster","C. elegans",
"S. cerevisiae")
names(genome.size) <- species.name
names(coding.genes) <- species.name
coding.bases <- coding.genes*0.0015
coding.bases
H. sapiens M. musculus D. melanogaster
31.1610 34.7085 20.9055
C. elegans S. cerevisiae
30.7980 10.0380
coding.pc <- coding.bases/genome.size*100
coding.pc
H. sapiens M. musculus D. melanogaster
1.004545 1.270908 12.370118
C. elegans S. cerevisiae
30.798000 83.650000
coding.bases[1]/coding.bases[5]
H. sapiens
3.104304
genome.size[1]/genome.size[5]
H. sapiens
258.5
coding.pc) but sometimes it is not (when we are comparing human to yeast). We can remove names by setting them to the special NULL value:names(coding.pc) <- NULL
coding.pc
[1] 1.004545 1.270908 12.370118 30.798000
[5] 83.650000
Typing lots of commands directly to R can be tedious. A better way is to write the commands to a file and then load it into R.
solution-exercise1.Rmd for solution to Exercise 1? followed by the function name. For example:?seq
example function:example(seq)
?? followed by your guess. R will return a list of possibilities:??plot
; end of line (Enables multiple commands to be placed on one line of text)# comment (indicates text is a comment and not executed)+ command line wrap (R is waiting for you to complete an expression)sum() is in the base package and sd(), which calculates the standard deviation of a vector, is in the stats packageinstall.packages()
install.packages(name.of.my.package)
source("http://bioconductor.org/biocLite.R")
biocLite() function:biocLite("PackageName")
install.packages() function to install it:install.packages("ggplot2")
DESeq is a Bioconductor package (http://www.bioconductor.org) for the analysis of RNA-seq data:source("http://www.bioconductor.org/biocLite.R")
biocLite("DESeq")
library(...) function to load the newly installed features:library(ggplot2) # Loads ggplot functions
library(DESeq) # Loads DESeq functions
library() # Lists all the packages installed
| First_Name | Second_Name | Full_Name | Sex | Age | Weight | Consent |
|---|---|---|---|---|---|---|
| Adam | Jones | Adam Jones | Male | 50 | 70.8 | TRUE |
| Eve | Parker | Eve Parker | Female | 21 | 67.9 | TRUE |
| John | Evans | John Evans | Male | 35 | 75.3 | FALSE |
| Mary | Davis | Mary Davis | Female | 45 | 61.9 | TRUE |
| Peter | Baker | Peter Baker | Male | 28 | 72.4 | FALSE |
| Paul | Daniels | Paul Daniels | Male | 31 | 69.9 | FALSE |
| Joanna | Edwards | Joanna Edwards | Female | 42 | 63.5 | FALSE |
| Matthew | Smith | Matthew Smith | Male | 33 | 71.5 | TRUE |
| David | Roberts | David Roberts | Male | 57 | 73.2 | FALSE |
| Sally | Wilson | Sally Wilson | Female | 62 | 64.8 | TRUE |
age <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9,
63.5, 71.5, 73.2, 64.8)
firstName <- c("Adam", "Eve", "John", "Mary",
"Peter", "Paul", "Joanna", "Matthew",
"David", "Sally")
secondName <- c("Jones", "Parker", "Evans", "Davis",
"Baker","Daniels", "Edwards", "Smith",
"Roberts", "Wilson")
TRUE and FALSE:consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE,
FALSE, FALSE, TRUE, FALSE, TRUE)
c(20, "a string", TRUE)
[1] "20" "a string" "TRUE"
class() function: class(firstName)
[1] "character"
class(age)
[1] "numeric"
class(weight)
[1] "numeric"
class(consent)
[1] "logical"
sex <- c("Male", "Female", "Male", "Female", "Male",
"Male", "Female", "Male", "Male", "Female")
sex
[1] "Male" "Female" "Male" "Female" "Male"
[6] "Male" "Female" "Male" "Male" "Female"
factor(sex)
[1] Male Female Male Female Male Male
[7] Female Male Male Female
Levels: Female Male
paste() function joins character vectors together)patients <- data.frame(firstName, secondName,
paste(firstName, secondName),
sex, age, weight, consent)
patients
firstName secondName paste.firstName..secondName. ...
1 Adam Jones Adam Jones
2 Eve Parker Eve Parker
3 John Evans John Evans
4 Mary Davis Mary Davis
5 Peter Baker Peter Baker
6 Paul Daniels Paul Daniels
7 ...
$’ operator:patients$age
paste() command)names() function, and we can use the same function to see the names:names(patients) <- c("First_Name", "Second_Name",
"Full_Name", "Sex", "Age",
"Weight", "Consent")
names(patients)
[1] "First_Name" "Second_Name" "Full_Name" "Sex"
[5] "Age" "Weight" "Consent"
patients <- data.frame(First_Name = firstName,
Second_Name = secondName,
Full_Name = paste(firstName, secondName),
Sex = sex,
Age = age,
Weight = weight,
Consent = consent)
names(patients)
[1] "First_Name" "Second_Name" "Full_Name" "Sex"
[5] "Age" "Weight" "Consent"
patients$First_Name
factor():patients <- data.frame(First_Name = firstName,
Second_Name = secondName,
Full_Name = paste(firstName, secondName),
Sex = factor(sex),
Age = age,
Weight = weight,
Consent = consent,
stringsAsFactors = FALSE)
patients$Sex
[1] Male Female Male Female Male Male Female Male
[9] Male Female
Levels: Female Male
patients$First_Name
[1] "Adam" "Eve" "John" "Mary" "Peter"
[6] "Paul" "Joanna" "Matthew" "David" "Sally"
All columns are assumed to contain the same data type, e.g. numerical
e <- matrix(1:10, nrow=5, ncol=2)
e
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
rowMeans(e)
[1] 3.5 4.5 5.5 6.5 7.5
object[rows, colums]e[1,2]
[1] 6
e[1,]
[1] 1 6
patients[1,2]
[1] "Jones"
patients[1,]
First_Name Second_Name Full_Name Sex Age Weight Consent
1 Adam Jones Adam Jones Male 50 70.8 TRUE
s <- letters[1:5]
s
[1] "a" "b" "c" "d" "e"
# View some of the values in s:
s[c(1,3)]
[1] "a" "c"
s[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
[1] "a" "c"
TRUE and FALSE values to subset the vectora <- 1:5
# Logical tests:
a < 3
[1] TRUE TRUE FALSE FALSE FALSE
s[a < 3]
[1] "a" "b"
<, >, <=, >=, ==, !=!, &, |, xor
TRUE, FALSE)s
[1] "a" "b" "c" "d" "e"
a
[1] 1 2 3 4 5
# More logical tests:
s[a > 1 & a <3]
[1] "b"
s[a == 2]
[1] "b"
country, continent, and height
country a character vector but continent a factorsummary() function on your data frame. What does it do? How does it treat vectors (numeric, character, logical) and factors? (What does it do for matrices?)patients[patients$Age < 40, ]
patients[patients$Consent == TRUE, ]
patients[patients$Sex=="Male" & patients$Weight>=70.8, ]
read.table()read.csv(), read.delim()write.table()write.csv()wd)setwd("/home/participant/Course_Materials")
Before we even start the analysis, we need to be sure of where the data are located on our hard drive
getwd()
list.files()
file.choose()rawdata)rawData <- read.delim("countData.txt")
file.choose():myfile <- file.choose()
rawData <- read.delim(myfile)
sep="," or the function read.csv():read.csv("countData.csv")
?read.table
Always check the object to make sure the contents and dimensions are as you expect
# View the first 10 rows to ensure import is OK
rawData[1:10,]
View() function to get a neat display of the data:View(rawData)
Once we have read the data successfully, we can start to interact with it
The object we have created is a data frame:
class(rawData)
[1] "data.frame"
ncol(rawData)
[1] 5
nrow(rawData)
[1] 50
dim(rawData)
[1] 50 5
str(rawData)
'data.frame': 50 obs. of 5 variables:
$ Patient: int 1 2 3 4 5 6 7 8 9 10 ...
$ Nuclei : int 65 51 37 37 45 46 65 59 49 46 ...
$ NB_Amp : int 0 3 2 2 2 4 1 1 0 0 ...
$ NB_Nor : int 63 43 35 35 42 41 64 54 48 45 ...
$ NB_Del : int 2 5 0 0 1 1 0 4 1 1 ...
colnames(rawData)
[1] "Patient" "Nuclei" "NB_Amp" "NB_Nor" "NB_Del"
rawData$Nuclei
Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham)
NA values, which means the values are missing – a common occurrence in real data collectionNA is a special value that can be present in objects of any type (logical, character, numeric etc)NA is not the same as NULL:
NULL is an empty R object.NA is one missing value within an R object (like a data frame or a vector)NAs gracefully:x <- c(1, NA, 3)
length(x)
[1] 3
NAs, and functions often have their own arguments (like na.rm) for handling them:mean(x, na.rm = TRUE)
[1] 2
mean(na.omit(x))
[1] 2
which() function to select indices from a logical vector that are TRUE# Create an index of results:
prop <- rawData$NB_Amp / rawData$Nuclei
# Find which observations have a proportion higher than 33%:
prop > 0.33
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[19] FALSE FALSE FALSE FALSE TRUE FALSE NA FALSE NA
[28] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE
[46] FALSE FALSE FALSE FALSE FALSE
# Get sample names of amplified patients:
amp <- which(prop > 0.33)
amp
[1] 11 23
plot(prop, ylim=c(0,1))
# Add a horizonal line:
abline(h=0.33)
write.csv(rawData[amp,], file="selectedSamples.csv")
getwd() # Print working directory
list.files() # List files in working directory
(NB_Amp / Nuclei < 0.33 & NB_Del == 0) (where NB_Del is the number of deletions)norm <- which(prop < 0.33 & rawData$NB_Del == 0)
norm
[1] 3 4 7 15 20 24 36 37 42 47
write.csv(rawData[norm,], "My_NB_output.csv")
lattice, ggplot2plot() will make a scatter plot with the values of the vector on the y axis, and indices in the x axis
patients$Weight
[1] 70.8 67.9 75.3 61.9 72.4 69.9 63.5 71.5 73.2 64.8
plot(patients$Weight)
plot():
patients$Age
[1] 50 21 35 45 28 31 42 33 57 62
plot(patients$Age, patients$Weight)
plot() functionbarplot()barplot(patients$Age)
barplot(summary(patients$Sex))
freq argument)hist(patients$Weight)
boxplot(patients$Weight, horizontal = TRUE)
y ~ x means put continuous variable y on the y axis and categorical x on the x axisboxplot(patients$Weight ~ patients$Sex)
example(dotchart)example(stripchart)example(vioplot) # From vioplot libraryexample(beeswarm) # From beeswarm libraryozone.csv:
weather <- read.csv("ozone.csv")
View(weather)
| Ozone | Solar.R | Wind | Temp | Month | Day |
|---|---|---|---|---|---|
| 41 | 190 | 7.4 | 67 | 5 | 1 |
| 36 | 118 | 8.0 | 72 | 5 | 2 |
| 12 | 149 | 12.6 | 74 | 5 | 3 |
| 18 | 313 | 11.5 | 62 | 5 | 4 |
| NA | NA | 14.3 | 56 | 5 | 5 |
| 28 | NA | 14.9 | 66 | 5 | 6 |
plot(weather$Solar.R, weather$Ozone)
hist(weather$Temp)
boxplot(weather$Ozone)
boxplot(weather$Ozone ~ weather$Month)
plot() comes with a large collection of arguments that can be set when we call the function:
?plot and ?parplot(patients$Weight, type = "l")
plot(patients$Weight, type = "b")
main argument:plot(patients$Age, patients$Weight,
main = "Relationship between Weight and Age")
plot(patients$Age, patients$Weight, xlab = "Age")
plot(patients$Age, patients$Weight, ylab = "Weight")
ylim and xlim are used to specify axis limitsplot(patients$Age,patients$Weight,
ylab="Weight",
xlab="Age",
main="Relationship between Weight and Age",
xlim=c(10,70),
ylim=c(60,80))
"red", "orange","green","blue","yellow"…hotpink1, gray44, orangered, grey94, lightgoldenrod4, lightsalmon3, thistle1, darkgoldenrod3…
colours()rgb(0.7, 0.7, 0.7) → A light grey in RGB values"#B3B3B3" → The same light grey ,in hexadecimal"#0000FF88" → A semi-transparent blue, in hexadecimal
Changing the col argument to plot() changes the colour that the points are plotted in:
plot(patients$Age, patients$Weight, col = "red")
plot(patients$Age, patients$Weight, pch = 16)
plot(patients$Age, patients$Weight, pch = "X")
Character expansion:
plot(patients$Age, patients$Weight, cex = 3)
Character expansion:
plot(patients$Age, patients$Weight, cex = 0.2)
plot(patients$Age, patients$Weight,
pch = 1:10, cex = 1:5,
col = c("red", "orange", "green", "blue"))
plot()
boxplot(patients$Weight~patients$Sex,
xlab = "Sex",
ylab = "Weight",
main = "Relationship between Weight and Gender",
col = c("blue","yellow"))
breaks and freq arguments to hist (?hist) to create 50 bins and display density rather than frequency?rainbow)plot(weather$Solar.R, weather$Ozone, col="orange", pch=16,
ylab="Ozone level", xlab="Solar Radiation",
main="Relationship between ozone level and
solar radiation")
hist(weather$Temp, col="purple", xlab="Temperature",
main="Distribution of Temperature", breaks = 50:100,
freq=FALSE)
rainbow() function is used to create a vector of colours for the boxplot; in other words a palette:
heat.colors(), terrain.colors(), topo.colors(), cm.colors()boxplot(weather$Ozone ~ weather$Month, col=rainbow(5))
RColorBrewer package:library(RColorBrewer)
display.brewer.all()
boxplot(weather$Ozone ~ weather$Month,
col=brewer.pal(5, "Set1"))